写在前面

Sigmoid 和 Softmax 是在逻辑回归和神经网络中常用的两个函数，初学时经常会对二者的差异和应用场景产生疑惑。接下来，对两者进行简单的了解。

Sigmoid 函数形式

　　$$ h(x) = \frac{1}{1 + e^{-\theta^Tx}} \tag{1} $$

Sigmoid 是一个可微的有界函数，在各点均有非负的导数。是当$x \rightarrow \infty $ 时，$h(x) \rightarrow 1$；当 $x \rightarrow -\infty $ 时，$h(x) \rightarrow -1$。常用于二元分类（Binary Classification）问题，以及神经网络的激活函数（Activation Function）（把线性的输入转换为非线性的输出）。

Softmax 函数形式

　$$ h(x) = \frac{e^{z_j}}{\sum_{k=1}^Ke^{z_k}},j=1,2,…,K \tag{2}$$

对于一个长度为 K 的任意实数矢量，Softmax 可以把它压缩为一个长度为 K 的、取值在$(0, 1)$ 区间的实数矢量，且矢量中各元素之和为 1。它在多元分类（Multiclass Classification）和神经网络中也有很多应用。Softmax 不同于普通的”max”函数，”max”函数只输出最大的那个值，而 Softmax 则确保较小的值也有较小的概率，不会被直接舍弃掉，是一个比较“Soft”的“max”。

在二元分类的情况下

对于 Sigmod，有：

$$p(y = 1 |x) = \frac{1}{1 + e^{-\theta^Tx}} \tag{3}$$

$$p(y=0|x) = 1- p(y=1|x) = \frac{e^{-\theta^Tx}}{1 + e^{-\theta^Tx}} \tag{4} $$

对于Softmax

当 K = 2 ，有：

$$p(y = 1 |x) = \frac{e^{\theta_1^T}x}{e^{\theta_0^T}x + e^{\theta_1^Tx}} = \frac{1}{1 + e^{(\theta_0^T - \theta_1^T)x}} \tag{5}$$

$$p(y = 1 |x) = \frac{e^{\theta_0^T}x}{e^{\theta_0^T}x + e^{\theta_1^Tx}} = \frac{e^{(\theta_0 - \theta_1)^Tx}}{1 + e^{(\theta_0 - \theta_1)^Tx}} = \frac{e^{-\beta x}}{1 + e^{-\beta x}} \tag{6}$$

其中

$$\beta = -(\theta_0 -\theta_1) \tag{7}$$

可见在二元分类的情况下，Softmax 退化为了 Sigmoid。

Hiahiahia…写在后面

关于softmax函数的问题，沐神推荐了@pluskid早年写的一篇blog解释这个问题。

Enjoy it!!